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Abstract —In this paper, we propose a novel sorting algorithm 
that sorts input data integer elements on-the-fly without any com¬ 
parison operations between the data—comparison-free sorting. 
We present a complete hardware structure, associated timing 
diagrams, and a formal mathematical proof, which show an 
overall sorting time, in terms of clock cycles, that is linearly 
proportional to the number of inputs, giving a speed complexity 
on the order of 0(N). Our hardware-based sorting algorithm pre¬ 
cludes the need for SRAM-based memory or complex circuitry, 
such as pipelining structures, but rather uses simple registers to 
hold the binary elements and the elements’ associated number of 
occurrences in the input set, and uses matrix-mapping operations 
to perform the sorting process. Thus, the total transistor count 
complexity is on the order of O(N). We evaluate an application- 
specified integrated circuit design of our sorting algorithm for a 
sample sorting of N = 1024 elements of size K = 10-bit using 
90-nm Taiwan Semiconductor Manufacturing Company (TSMC) 
technology with a 1 V power supply. Results verify that our 
sorting requires approximately 4-6 /is to sort the 1024 elements 
with a clock cycle time of 0.5 GHz, consumes 1.6 mW of power, 
and has a total transistor count of less than 750 000. 

Index Terms —90-nm TSMC, comparison free, Gigahertz clock 
cycle, one-hot weight representation, sorting algorithms, SRAM, 
speed complexity O(N). 


I. Introduction, Motivation, and Related Work 

S ORTING algorithms have been widely researched for 
decades [l]-[6] due to the ubiquitous need for sorting in 
many application domains [7]—[10]. Sorting algorithms have 
been specialized for particular sorting requirements/situations, 
such as large computations for processing data [11], high¬ 
speed sorting [12], improving memory performance [13], 
sorting using a single CPU [14], exploiting the parallelism 
of multiple CPUs [15], parallel processing for grid-computing 
in order to leverage the CPU’s powerful computing resources 
for big data processing [16]. 
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Due to the ever-increasing computational power of parrallel 
processing on many core CPU- and GPU-based process¬ 
ing systems, much research has focused on harnessing 
the computational power of these resources for efficient 
sorting [17]-[20]. However, since not all computing domains 
and sorting applications can leverage the high throughput 
of these systems, there is still a great need for novel and 
transformative sorting methods. Additionally, there is no clear 
dominate sorting algorithm due to many factors [21]-[24], 
including the algorithm’s percentage utilization of the available 
CPU/GPU resources, the specific data type being sorted, 
amount of data being sorted. 

To address these challenges, much research has focused 
on architecting customized hardware designs for sorting algo¬ 
rithms in order to fully utilize the hardware resources and 
provide custom, cost-effective hardware processing [2]-[27]. 
However, due to the inherent complexity of the sorting 
algorithms, efficient hardware implementation is challenging. 
To realize fast and power-efficient hardware sorting, a sig¬ 
nificant amount of hardware resources are required, including, 
but not limited to, comparators, memory elements, large global 
memories, and complex pipelining, in addition to complicated 
local and global control units. 

Most prior work on hardware sorting designs are imple¬ 
mented based on some modification of traditional mathemati¬ 
cal algorithms [28]—[31], or are based on some modified net¬ 
work of switching structures [32]-[34] with partially parallel 
computing processing and pipelining stages. In these sorting 
architectures, comparison units are essential components that 
are characterized by high-power consumption and feedback 
control logic delays. These sorting methods iteratively move 
data between comparison units and local memories, requiring 
wide, high-speed data buses, involving numerous shift, swap, 
comparison, and store/fetch operations, and have complicated 
control logic, all of which do not scale well and may need spe¬ 
cialization for certain data-type particulars. Due to the inherent 
mixture of data processing and control logic within the sorting 
structure’s processing elements, designing these structures can 
be cumbersome, imposing large design costs in terms of area, 
power, and processing time. Furthermore, these structures are 
not inherently scalable due to the complexity of integrating 
and combining the data path and control logic within the 
processing units, thus potentially requiring a full redesign for 
different data sizes, as well as complex connective wiring with 
high fan-out and fan-in in addition to coupling effects, thus 
circuit timing issues are challenging to address. Additionally, 
if multiple processors are used along with pipelining stages 
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and global memories, the data must be globally merged 
from these stages to output the complete final sorted data 
set [35], [36]. 

To address these challenges, in this paper, we propose 
a new sorting algorithm targeted for custom, IC-designed 
applications that sort small- to moderate-sized input sets, 
such as graphics accelerators, network routers, and video 
processing DSP chips [12], [33], [44], [46]. For example, 
graphics processing uses a painter unit that renders objects 
according to the object’s depth value such that the object 
can be displayed in the correct order on the screen. In video 
processing, fast computation is required for small matrices 
in a frame in order to increase the resolution using digi¬ 
tal filters that leverage sorting algorithms. Even though we 
present our design based on these scenarios, our design also 
supports processing large input sets by subsequently process¬ 
ing the data in multiple, smaller input sets (i.e., in sets of 
A < 100 000) using fast computations, and then merging 
these sets. However, since applications with larger input 
sets (on the order of millions) are usually embedded into 
systems with large computational resources, such as data min¬ 
ing and database visualization applications running on high- 
performance grid computing and GPU accelerators [17]-[20], 
these applications can harness those powerful resources for 
sorting. 

Our sorting algorithm’s main features and contributions 
include as follows. 

1) Our design affords continuous sorting of input element 
sets, where each set can hold any type and distribu¬ 
tion (ordering) of data elements. Sorting is triggered 
with a start-sort signal and sorting ends when a done- 
sorting signal is asserted by the design, which subse¬ 
quently begins sorting the next input set, thus affording 
continuous, end-to-end sorting. 

2) Our sorting design does not require any ALU- 

comparisons/shifting-swapping, complex circuitry, 

or SRAM-based memory, and processes data in a 
forward moving direction through the circuit. Our 
design’s simplicity results in a highly linearized 
sorting method with a CMOS transistor count that 
grows on the order of O(A). Hence, the design 
provides low and efficient power components with the 
addition of regularity and scalability as key structure 
features, which provide easily and quick miagration to 
embedded micro-controllers and field-programmable 
gate arrays (FPGAs). 

3) The sorting delay time is always linearly proportional 
to the number of input data elements A, with upper 
and lower bounds of 3 A and 2A clock cycles, respec¬ 
tively, giving a linear sorting delay time of O(A). 
This sorting time is independent of the input elements’ 
ordering or repitition since the design always performs 
the same operations within these bounds as opposed to 
Quicksort and other sorting algorithms, which have large 
and nonlinear margin of bounds. 

The remainder of this paper is organized as follows. Section II 
summarizes related works and the works’ cost-performance 
bottleneck tendencies. Section III discusses our proposed 


comparison-free sorting algorithm with illustrative exam¬ 
ples and Section IV provides a mathematical analysis. 
Section V details the hardware data path and control logic 
implementations along with timing diagrams. Section VI 
presents our simulation results, and Section VII discusses our 
conclusions, which elaborate on the overall results and our 
design’s hardware advantages. 

II. Related Work 

In order to provide high scalability, it is critical to design a 
sorting method with timing and circuit complexity that scales 
linearly with the number of input elements A [i.e., the circuit 
timing delay and circuit complexity are on the complexity 
order of O(A)]. Although some recent works showed linear 
scalability, these works’ 0(A) notations hide a large scalar 
value [4], [27], [32], [34] and these methods have expensive 
circuit complexity with respect to multiprocessing, local and 
global memories, pipelining, and control units with special 
instruction sets, in addition to high-cost technology power 
factors. 

Other recent works [2], [25], [37]-[42] divide the sorting 
algorithm design into smaller computation partitions, where 
each partition integrates control logic and the partition’s com¬ 
parison operations with feedback decisions from neighbor¬ 
ing partitions. A global control unit coordinates this control 
to streamline the data flow between the partitions and the 
partitions’ associated memories to store temporary data that 
is transferred between partitions. In addition to the complex 
circuitry required to maintain inter-partition connectivity and 
redundant intra-partition control circuitry, a complex global 
memory organization is required. 

Alternative methods [43]—[45] attempt to eliminate 
comparators by introducing a rank (sorted) ordering 
approach. In [43], a bit-serial sorter architecture was 
implemented based on a rank-order filter (ROF), but 
comparators were still used to transform the programmable 
capacitive threshold logic (CTL) to a majority voting decision. 
That design used large array cells of ROF and CTL decisions 
with a pipelined architecture. The design in [44] counted the 
number of occurrences of every element in the unsorted input 
array, where the rank of each element was determined by 
counting the number of elements less than or equal to the 
element being considered. Thus, the comparison units were 
replaced by counting units with bit comparison. However, 
the design required a complicated hardware structure with 
pipelining and a histogram counting sequence. Alternatively, 
the design in [45] used a rank matrix that assigned relative 
ranks to the input elements, where the highest element had 
the maximum rank and the lowest element had the lowest 
rank of 1. The rank matrix was updated based on the value of 
a particular bit in each of the A input elements, starting with 
the most-significant bit. This bit-wise inspection required 
inspecting a complete column of the rank matrix in order 
for the lower ranks to update the higher ranks. However, that 
design could not be used when the number of elements was 
less than the elements’ bit-width. 

Some recent works [47]-[49] leverage previous works and 
integrate several different sorting architectures for different 
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requirements, such as speed, area, power. The work in [47] 
leveraged a bitonic sorting network to more efficiently map the 
methodology considering energy and memory overheads for 
FPGA devices. Further advances of that work [48] presented 
novel and improved cost-performance tradeoffs, as well as 
identification of some Pareto optimal solutions trading off 
energy and memory overheads. Additional work [49] devel¬ 
oped a framework that composes basic sorting architectures 
to generate a cost-efficient hybrid sorting architecture, which 
enabled fast hardware generation customized for heteroge¬ 
neous FPGA/CPU systems. 

Even though all of these designs reported linear sorting 
delay times as the number of input elements increased, 
the authors did not include the initialization times for the 
required arrays/matrices, nor was the worst case sorting time 
evaluated. Furthermore, each design either required arrays 
to store the input elements, associated arrays for the rank 
operations and data routing, or had to globally merge the 
intermediate sorted array partitions. These array elements 
required a significant amount of local and global input-output 
data routing, SRAM-based memory, and control signals, where 
the local control logic communicated with each processing unit 
partition and the global control unit. This layout complicates 
adapting the design to different input data bit-widths. Addi¬ 
tionally, since the control signals and data path wiring was 
intertwined, circuit design bugs were challenging to locate, 
in turn leading to high-cost design. 

III. Comparison-Free Sorting Algorithm 

The input to our sorting algorithm is a K -bit binary bus, 
which enables sorting A = 2 K input data elements. The 
sorting algorithm operates on the element’s one-hot weight 
representation, which is a unique count weight associated with 
each of the A elements. For example, “5” has a binary repre¬ 
sentation of “101,” which has a one-hot weight representation 
of “100000.” For a complete set of A = 2 K data elements, 
the one-hot weight representation’s bit-width 77 is equal to 
the number of possible unique input elements. For example, 
a K = 3-bit input bus can sort/represent A = 8 elements, 
so each element’s one-hot weight representation is of size 
77 = 8-bit (i.e., 77 = A). The binary to one-hot weight 
representation conversion is a simple transformation using 
a conventional one-hot decoder. Using this one-hot weight 
representation method ensures that different elements are 
orthogonal with respect to each other when projected into 
an R n linear space. 

For brevity of discussion and ease of understanding our 
sorting method’s mathematical functionality, we illustrate a 
small example in Fig. 1, which is based on linear algebra 
vector computations. This example shows our sorting algo¬ 
rithm’s functionality using four 2-bit input data elements, 
with an initial (random and arbitrary) sequential ordering 
of [2; 0; 3; 1], which generates the outputted elements in the 
sorted matrix = [3; 2; 1; 0]. This sorting matrix is in descend¬ 
ing order; however, the elements can also be represented in 
ascending order by having the mapping go from the bottom 
row to the upper row. 




Fig. 1. Comparison-free sorting example using four 2-bit input data elements. 

This example operates as follows. The inputted elements are 
inserted into a binary matrix of size Ax 7, where each element 
is of size 7-bit (in this example A = 4 and k = 2 bit). 
Concurrently, the inputted elements are converted to a one- 
hot weight representation and stored into a one-hot matrix 
of size A x 77, where each stored element is of size 77-bit 
and 77=A giving a one-hot matrix of size A-bit xA-bit. The 
one-hot matrix is transposed to a transpose matrix of size 
A x A, which is multiplied by the binary matrix—rather than 
using comparison operations—to produce the sorted matrix. 
For repeated elements in the input set, the one-hot transpose 
matrix stores multiple “Is” (equal to the number of occurances 
of the repeated element in the input set) in the element’s 
associated row, where each “1” in the row maps to identical 
elements in the binary matrix, an advantage that will be 
exploited in the hardware design (Section V). For example, 
if the input set matrix is [2; 0; 2; 1], then the transpose matrix 
is [0 0 0 0; 1 0 1 0; 0 0 0 1; 0 1 0 0]. Notice that 
the second row contains two “Is,” such that when the transpose 
matrix is multiplied by the second row in the binary matrix, 
both “1” occurances in the transpose matrix are mapped to 
the “2” in the binary matrix. Therefore, the multiply operation 
can be simply replaced with a mapping function using a 
tri-state buffer (Section V). Additionally, the first row in 
the transpose matrix has no element in the first position 
(i.e., element 3 is not in the binary matrix since 3 is not in the 
input set). The absence of this element can be recorded using 
a counting register for each inputted element (Section V), and 
this register records the number of occurences of this element 
in the binary matrix, which in this case would be “0” for 
element 3. 

For more insight on this algorithm, Fig. 2 shows C-code 
for a single-threaded implementation on a single CPU, where 
the transpose matrix is used as a vector matrix instead of a 
2-D matrix such that the indices of the TMn x l matrix record 
the counting elements of size Axl. Hence, the initialization 
phase, which is structured in the first loop, requires less 
memory access time for the reads and writes in the loop 
body. The evaluation phase is conducted in the second loop, 
and in this phase, the elements are sorted and stored in the 
sorted vector SSn x l- The elements in the array vector TM^ X \ 
are read sequentially, and concurrently the elements in the 
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Fig. 2. Comparison-free sorting C-code for a single-threaded single CPU. 
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Fig. 3. Execution comparisons of for our comparison-free sorting design, 
Quicksort, merge sort, and radix sort. 

sorted vector SSnx l are written sequentially, resulting in good 
spatial locality in the second loop of the C-code. Due to these 
structural designs, initial insight in our simulation results for a 
single-threaded single CPU, which is shown in Fig. 3, reveal 
the advantages of our proposed algorithm in execution time 
over other popular sorting algorthms such as Quicksort and 
other standard sorting algorithms reported in [50] 

IV. Mathematical Analysis 

In this section, we provide the mathematical proof for 
our sorting algorithm illustrating the case of N unique input 
elements as a proof of concept. We present this case as the base 
case proof for our sorting algorithm since other input element 
set cases (i.e., different numbers of duplicated elements) can 
be easily derived from this case. 

Let 


L = [a(l),.. .,a(k)] 

be a given list 1 of k positive integers and let 


( 1 ) 


Fig. 4. Block diagram of the hardware structure for our sorting algorithm. 

Thus, if s does not belong to L (i.e., there is no r such 
that a(r) = s), then the sth column of J will contain all “Os.” 
If s belongs to L, then the sth column of J will have “Is” in 
exactly the locations r where a(r) = s. 

Supposing that L had no repetitions, let 

LJ = [a(l),...,a(k)] 

J = [b(l),...,b(m)] (4) 


which gives 


*(s) = 


S, 

0 , 


if s e L 
otherwise. 


( 5 ) 


If s £ L, then all of the values in the sth column of C s of J 
are “0s,” and b (s) = L • Cj = 0. If s e L, and if r is the 
unique value for which a(r) = s, then all of the values in the 
sth column of C s of J are all “0” except for the value in the rth 
column, which is “1.” Therefore, b (s) = L • C[ = a(r) = s, 
which proves our claim. 

For example, starting with L = [6, 3, 4], then J = Jl would 
be the matrix 


and 


J = 


000001 

001000 

000100 


( 6 ) 


LJ = [0,0, 3, 4, 0,6]. (7) 

Let J* be the matrix obtained by deleting the zero columns 
from J such that 


LJ* = [3,4, 6]. 


( 8 ) 


V. Hardware Functionality Details 


M = max[a(l),..., a(k)]. (2) 


Let J = J L be the (k x M) matrix whose entries are 
defined by 


Jr, s — 


1, 

0, 


if a{r) = s 
otherwise. 


( 3 ) 


1 A list is a set in which repetition is allowed. 


The overall hardware structure for our sorting algorithm is 
divided into two parts: the data path and the control unit. 
Fig. 4 depicts the input-output signals of a complete block 
diagram for our sorting algorithm, which sorts of N = 2 K 
input data elements. The basic design architecture operates in 
two sequential phases: the write-evaluate phase (Section V-Al) 
followed by the read-sort phase (Section V-A2). The control 
unit (Section V-B), is a simple state machine that controls the 
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Fig. 5. Hardware flow for the write-evaluate phase. 

data path’s phases using only a few D-type flip-flop (DFF) 
components. Sorting begins when the START-EXT signal is 
asserted and the design signals that sorting has completed by 
asserting the FINAL-EXT signal. 

A. Data Path Operation 

The data path contains several circuit components: a one-hot 
decoder, register arrays, a serial shifter, a parallel counter (PC), 
tri-state buffers and multiplexors, a one-detector, and an incre- 
mentor/decrementor circuit. In order to meet the setup-hold 
delay time bewteen the clock and data stabilization for the 
elements’ storage registers, the delay element’s components 
are a cascade of an even number of inverters. These circuit 
components are standard CMOS circuit components [51]—[53], 
which are commonly used components for advanced 
CMOS technologies beyond 90 nm, making our design scal¬ 
able for further advanced low-cost CMOS technologies. 

Before proceeding with a more detailed circuit structure of 
the write-evaluate and read-sort phases, we present generalized 
and overall illustrations for these phases in the flow charts 
in Figs. 5 and 6, respectively. The rectangles present the 
operations during each clock cycle event, in which two events 
occur per clock cycle, one on each cycle edge (i.e., asserted 
high and low). The steps within the rectangles show the 
sequences of the operations based on the data hardware flow 
shown in Figs. 7 and 9, where some operations have the same 
number indicating parallelism/independence between these 
operations within the clock cycle, meaning that it does not mat¬ 
ter which operation occurs first. Additionally, these flow charts 
adhere to the timing constraints depicted in Figs. 8 and 10, 
respectively, where each event occurs at a clock edge. The 
diamonds are the condition expressions that change the data 
flow based on control flow events. 

1) Write-Evaluate Phase: During the write-evaluate phase, 
each binary input element is converted to the element’s one-hot 



O 

3 

<D 

O 


Fig. 6. Hardware chart for read-sort phase. 



-K-BIT BUS 

Fig. 7. Detailed block diagram of our sorting algorithm’s write-evaluate 
phase. 



Fig. 8. Timing diagraph for our sorting algorithm’s write-evaluate phase. 

weight representation by the one-hot decoder. The decoder’s 
output enables an associated register in a register array to 
record the binary input element’s occurrence. We refer to 
this register as an order register (ORi) array, where the 
i th register stores the ith input element. Each register is a 















































































































































This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 


6 


IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 



Fig. 9. Detailed block diagram of our sorting algorithm’s read-sort phase. 



Fig. 10. Timing diagraph for our sorting algorithm’s read-sort phase. 

simple DFF register of size k-bit. This operation is equivalent 
to the recording of the element in the transposed matrix in our 
algorithm (Section III). Simultaneously, the one-hot decoder 
enables an associated register in another register array— 
the flag register ( FRi ) array—which records the number of 
occurrences of this element in the input set. For each occur¬ 
rence of a duplicated element, the associated flag register is 
triggered, and the occurrence is recorded by incrementing 
the register’s stored value using a 10-bit incrementor. This 
operation is equivalent to having multiple “Is” for repeated 
elements in a row in the transpose matrix (Section III). 

All input elements follow the same sequential operation at 
every rising clock edge. Fig. 7 illustrates a detailed block 
diagram of the write-evaluate phase’s data path, which shows 
the input bus and all control signals that are fed from the 
control unit (Section V-B). Fig. 8 depicts the associated timing 
diagram, which shows the detailed streamlined sequential 
timing for the write-evaluate phase. In this diagram, the 
START-EXT signal indicates the beginning of a new block 
of N = 2 K k-bit input elements, which arrive sequentially 
on each clock cycle. The START-EXT signal consecutively 
triggers several intermediate signals in the write-evaluate data 
path’s circuit. First, the reset signal RES is asserted high 
for one clock cycle to initialize all registers (omitted from 


Fig. 7 for figure clarity). Next, the WRITE-ENA signal is used 
to direct the input data to the one-hot decoder, and enable the 
clocking source for the order and flag register arrays, which 
are actually gated by another AND-gate that comes from the 
one-hot decoder. 

Following the timing diagram in Fig. 8, the write-evaluate 
cycle time requires time for the one-hot decoder (T 0 h), time 
for the order and flag registers’ access times, (T or ) and (7f r ), 
respectively, and time for the flag register increment (T acc ). 
The total write-evaluate phase’s cycle time (T wr it e -cycie) is 

^write-cycle = ^oh + T or + T acc + 7f r . (9) 

The delay element’s components have no influence on the 
write-evaluate cycle time since these components only change 
the duty cycle while preserving the cycle time. All of the 
registers (order and flag) are structured in parallel, such that 
the access times to the registers are on the order of fractions 
of a nano-second. Additionally, the simple incrementor is less 
than a nano-second time scale since the bit-width is only 
k-bits. One incrementor is shared for all flag registers since 
only one element is input per clock cycle. 

A parallel counter in the control unit (Section V-B) controls 
the end of the write-evaluate phase when the counter’s value 
reaches the maximum number possible inputted elements 
(i.e., N = 2 k ). Even though the input set may contain 
less than the maximum number of elements, assuming that 
the input set is full realizes the simplisity of the read-sort 
phase’s operation. The control unit asserts the READ-ENA 
signal and deasserts the WRITE-ENA signal when the write- 
evaluate phase completes, which enables the read-sort phase 
on the next clock edge. The write-evaluate phase requires a 
fixed N clock cycles since the phase always iterates for the 
maximum number of potential input elements. 

2) Read-Sort Phase: Fig. 9 illustrates a detailed block 
diagram of the read-sort phase’s data path, which comprises 
of a k-bit sorted shift register (SRi) array of size N that stores 
the elements in their final sorted order, and a k-bit PC that 
indexes into the order register array to process each element in 
turn. The element ordering, ascending or descending, is user- 
specified, and can be controlled by either left- or right-shifting 
in the elements. A one-detector circuit detects if the flag 
register value is “1” or not, and a decrementor circuit subtracts 
a “1” from the flag register, the result of which is stored back 
in to the flag register, when processing replicated elements. 
In this figure, the write-evaluate phase’s data path components 
that are used in the read-sort phase are encompassed in the 
dashed lines. 

The read-sort phase begins after the WRITE-ENA signal 
is deasserted and the READ-ENA signal is asserted, which 
sends the PCs value to the one-hot decoder at each new read- 
sort clock cycle. The one-hot decoder converts this counter 
value to the value’s one-hot representation, which enables the 
associated order and flag registers to read/release the registers’ 
values, and the order register’s value is stored into the sorted 
register array if-and-only-if that element’s flag register value 
is greater than “0,” meaning there was at least one occurrence 
of that input element. The one-detector evaluates the flag 
register value to control whether or not the element is stored 
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in the sorted register array. If the flag register records a value 
equal to or greater than “1,” the associated element should be 
stored in the sorted register array a number of times equal 
to the flag register’s value. The case is simple when the flag 
register value is “1which is detected by the one-detector. 
To avoid complex comparison units (i.e., equal to or greater 
than “1”), detecting values greater than “1” can be easily 
determined using the decrementor’s carry out single. Thus, 
if the one-detector’s evaluation is false (i.e., “0” is the one- 
detector’s decision output), but when decrementing the flag 
register’s value, the resulting carry out flag is “0,” this means 
that the flag register’s value was greater than “1.” In both 
cases, the input element should be stored into the sorted 
register array. Indexing to the next input element is inhibited 
by disabling the PC s increment, which allows the replicated 
element to be stored in the sorted register array until the flag 
register value reaches “0.” Otherwise, the flag register’s value 
is “0,” the element is not in the input set, and thus is not stored 
into the sorted register array, and the PC is incremented. 

The read-sort cycle time can be divided into three cases 
based on the flag register’s value. For clarity, these cases will 
be described with references to the example in Fig. 1 and 
the discussion of the structure in Section III. In case one, 
the flag register’s value is “0” (i.e., the element is not in the 
binary matrix), and thus, this element is not stored in the sorted 
register array, and the PC is incremented (i.e., proceed to the 
next row in the transpose matrix). The timing of the read- 
sort cycle (r r ead-cycle) in case one is the sum of the PCs 
increment (Tpc), the one-hot decoder’s (Toh), and the one- 
detector’s (Tod) delays 

Tread-cycle = T P c + Toh + Tod- (10) 

We can see that the one-detector and decrementor both operate 
concurrently with the flag register value’s evaluation. 

In case two, the flag register’s value is “1,” meaning that 
the element is in the input set once, and thus this element is 
read from the order register using the one-hot decoder and a 
tri-state buffer at the register’s output, the element is stored in 
the sorted register array, and the PC is incremented. As with 
case one, a flag register value of “0” and “1” both require one 
clock cycle. The timing of the read-sort cycle (J rea d-cycle) in 
this case is the sum of the PCs increment (Tpc), the one- 
hot decoder’s (Toh), the one-detector’s (Tod), and the sorted 
register array’s (Tsr) delays 

Tread-cycle = Tpc + Toh + Tod + Tsr. (11) 

In case three, the flag register’s value is greater than “1” 
(i.e., the element’s corresponding row in the transpose matrix 
contains more than one “1”). Similar to case two, this element 
is stored into the sorted register array, but in this case, the flag 
register is also decremented. The PCs increment is disabled 
until the element’s flag register reaches “1,” signaling that all 
occurrences of the element have been stored into the sorted 
output array. The timing of the read-sort cycle (T rea d-cycle) in 
this case is the sum of the PCs increment (Tpc), the one-hot 
decoder’s (Toh), the decrementor’s (Tda), and the flag register 
array’s (Tfr) delays 

Tread-cycle = Tpc + Tqh + T D A + T F r. (12) 


Fig. 10 shows the timing diagram for the read-sort phase for 
all three cases, where the circled area shows the clock cycle 
operations for case two and three. Case three is assumed to be 
the worst case due to the decrementor’s delay, which has more 
delay than the one-detector delay (Tod) as given in case 2. 

The additional required logic gates’ delays, such as the XOR 
gate, tri-state buffer, and AND gates, are not included in the 
above delay equations since these gates require only fractions 
of nano-seconds. Additionally, delay buffer #3 (Fig. 9) has 
no effect on the read-sort cycle time since this delay element 
is only used for maintaining the setup-hold time between the 
clock ( CLK) and the element being stored in the sorted register 
array. 

Case three represents the worst case, upper bound sorting 
time when the input element set contans N occurances of the 
same element (i.e., one row in the transpose matrix has all 
“1” values, while all other rows have all “0” values). The 
corresponding flag register’s value for this element is “A,” 
while all other flag registers’ values are “0.” Our algorithm 
requires N— 1 cycles to check all flag register values (i.e., all 
transpose matrix rows), even though all values are “0,” and 
N cycles to output the single replicated element N times into 
the sorted register array. Therfore, the total number of clock 
cycles are 2 N — 1 plus one cycle for reset, resulting in a total 
worst case, upper bound of 2 N. 

The best case, lower bound occurs when all elements in 
the input set are distinct (i.e., every transpose matrix row 
contains either a single “1” or no “Is,” case one and case 
two, respectively). During the read-sort phase, each cycle 
either stores one element or nothing, respectively, to the 
sorted register array, which requires N clock cycles to sort 
N elements. 

On average and in most general cases, the input set will 
contain a mixture of distinct and repeated elements, and the 
actual sorting time will fall between the upper and lower 
bounds. Considering both the write-evaluate and read-sort 
phases, the required number of clock cycles ranges from 
2 N to 3 N to sort the input elements, with the addition of the 
one clock cycle for reset and one clock cycle for the control 
switch between the write-evaluate and read-sort phases. 

B. Control Unit Operation 

The control unit receives input signals from the data path 
and outputs the appropriate control signals back to the data 
path. The control unit also receives the external and hand¬ 
shaking components’ signals in order to interface with the 
external components that are using the sorting hardware, and 
synchronizes the complete sorting operation. There are several 
methods for designing the control unit [54], [55], and prior 
work on sorting hardware typically found it sufficient to 
present only the data path design and no detail on the control 
logic [2], [34]-[45]. However, in our work, we present the 
complete control unit design in order to provide a holistic 
sorting implementation with all signals, which alleviates any 
discrepancy between the control and data path units. Addi¬ 
tionally, our inclusion of the control unit’s design shows 
the simplicity of our sorting hardware, with the control unit 
using a small number of gates and is scalable and easily 
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Fig. 11. Control unit diagram for the write-evaluate unit. 



Fig. 12. Control unit diagram for the read-sort unit. 


reconfigurable to different data types and sizes. We note that 
further area optimization can easily be achieved by reusing 
components for many handshaking controls with the data path 
unit, however, without loss of generality and for an easier 
conceptual explanation, we describe the control unit without 
shared components. In regards to timing and power, most of 
the components in the control unit are fast, and respond within 
the DFF access time delay. Additionally, most of the DFFs 
are clock-gated with an enable signal to minimize the DFFs’ 
switching activities upon needed, thus reducing the overall 
circuit’s power consumption. 

Collectively, Figs. 11 and 12 depict the complete block 
diagrams for the control unit. For ease of explanation, the con¬ 
trol unit divides the control logic structure into the write- 
evaluate and read-sort phases’ controls, respectively, however, 
physically the control units share common components, such 
as the clock and the reset-initialization block. 

The write-evaluate control circuitry (Fig. 11) is derived 
from the write-evaluate timing diagram (Fig. 8) and receives 
as input the external signals CLOCK-EXT , RES-EXT , and 
START-EXT. These signals control the sorting of the input 
bus elements, such that the data path generates the outputted 
sorted elements on the output bus and signals the end of 
sorting by asserting the FINAL-EXT signal. The internal reset- 
initialization block is triggered by the START-EXT signal, 
which in turn asserts the RES signal for one clock cycle. 
This complete clock cycle ensures that the reset-initialized 
components receive the asserted RES signal for long enough 


to ensure state initialization in the components, regardless of 
the underlying technology and fan-out interconnect. Several 
reset signals are branched and routed to different components 
in order to minimize the effective load on the RES signal. 
Additionally, the clock tree is designed in order to balance the 
clock edges across the components and preserve the setup- 
hold time margins, the details of which have been omitted in 
this figure for figure clarity. 

All input and output signals are associated with 
appropriately-sized drivers to minimize the resistor-capacitive 
load on the input signals, and ensures that the signals propa¬ 
gate quickly enough and at full-swing with an appropriate sig¬ 
nal slew-rate. We refer the reader to [53] for further details on 
load balancing and using appropriately-sized drivers. Asserting 
the RES signal (after START-EXT is asserted) for one clock 
cycle begins initializing the master-slave DFF structure for 
further operations. Subsequently, de-asserting the RES signal 
triggers asserting the WRITE-ENA signal for the complete 
write-evaluate phase. Once the control unit’s PC reaches 
the saturated state N = 2 K , all input elements have been 
processed, which indicates the end of the write-evaluate phase. 
The WRITE-ENA signal is de-asserted and the READ-ENA 
signal is asserted on the next CLK edge, as illustrated in the 
timing diagram in Fig. 8. 

The read-sort phase’s control unit’s circuitry (Fig. 12) is 
derived from the read-sort timing diagram (Fig. 10). The 
READ-ENA signal is asserted one clock cycle after the WRITE- 
ENA is de-asserted. At this point, the data path’s PC is enabled 
and activates the one-hot decoder, order register array, flag 
register array, and one-detector. When the data path’s PC 
saturates (i.e., all order and flag register values have been 
evaluated), the data path asserts the FINAL-STATE signal 
that drives the control unit. The control unit deasserts the 
READ-ENA signal and asserts the FINAL-EXT signal indi¬ 
cating that sorting is complete. The FINAL-STATE signal 
indicates that all rows in the transpose matrix have been 
scanned and mapped to the sorted array register. 

The synchronization of these operations are inherent-by- 
design using DFFs with a SET and RESET structure, as given 
in [59]. The complete control unit only requires seven DFFs 
for controlling the continuous sorting of input elements. The 
simplicity of our control unit circuitry design is due to the 
continuous forward-flowing data through the data path and 
results in simple timing, which is amenable to efficient circuit 
design structures. 

VI. Simulations and Results 

Without loss of generality and for comparison purposes, 
we implemented, tested, and verified our sorting algorithm 
and hardware architecture using a sample system with N = 
1024 input data elements, which is similar to many prior hard¬ 
ware sorting integrated circuits (ICs) [2], [37]—[45], [47]-[49]. 
We architected our proposed comparison-free sorting hardware 
at the CMOS transistor level using 90-nm Taiwan Semicon¬ 
ductor Manufacturing Company (TSMC) technology with a 1 
V power supply [56]. We gathered timing delay values, total 
power consumption, and total transistor counts using HSPICE 
simulations [57]. 
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TABLE I 

Component Time Delays and Transistor Counts 
Assuming 90-nm Technology 


Component 

Size 

Delay 

Transistor 

Count 

Binary order 
registers 

1024 registers 

10 DFFs per 
register 

0.14 ns 

204800 

Binary flag 
registers 

1024 registers 

10 DFFs per 
register 

0.14 ns 

204800 

Serial 

registers 

1024 registers 

10 DFFs per 
register 

0.14 ns 

Shift one 
element from 
one register to 
another 

204800 

One-Hot 

Decoder 

10-to-1024 

5 levels 

0.68 ns 

5456 

One-detector 

10-bit input 
1-bit output 

2 levels 

0.26 ns 

24 

Incrementor 

Decrementor 

10-bit input 
10-bit output 

0.37 ns 

200 

Tri-State 
Buffer (D s ) 

10 buffers per 
register 

0.14 ns 

10*1024*4 

*2=81920 

Parallel 

Counter 

10-bit 

parallel-in 

10-bit 

parallel-out 

0.167 ns 

330*2= 

660 

Control 

Logic 

7 DFF 

0.14 ns 

7*24=168 


The one-hot decoder, which converts the 10-bit input bus 
binary representation to the 1024-bit one-hot weight repre¬ 
sentation, uses a four-input fan-in NAND logic gate with a 
five-level hierarchical structure, resulting in a timing delay of 
Tqh = 0.688 ns. The order and flag registers are comprised 
of ten parallel DFFs, such that the register access time can 
be approximated using a single DFF access time of Tdff = 
0.14 ns. Similarly, the tri-state buffer and multiplexer are 
approximated as the same delay as the DFF access time 
Ttb = Tmux = Tdff. 

The one-detector uses a parallel prefix-tree structure of four- 
input OR-gates, which take as input 10 bits and activates a 
two-level output, resulting in a timing delay of Tod = 0.26 ns. 
The data path’s 10-bit PC is implemented based on state-look 
ahead logic [58], giving a timing delay to the next state of 
approximately 0.167 ns. The incrementor/decrementor circuit 
takes a 10-bit input bus and add/subtract a “1,” giving a timing 
delay of approximately 0.37 ns. 

Table I summarizes all of the components’ delay times 
and associated transistor counts. These results, combined 
with (9)-(12), show that the write-evaluate phase’s clock cycle 
time is CLKw < 2 ns and the read-sort phase’s clock cycle 
time is CLKr < 2 ns. These timings result in an approximate 



Fig. 13. Transistor counts for the order, flag, and sorted register arrays as 
compared number of elements. 



conservative clock cycle frequency of 500 MHz, and the 
total power consumption given the technology factor at this 
frequency is 1.6 mW. Sorting 1024 elements requires a total 
number of clock cycles ranging from 3 x 1024 = 3076 
to 2 x 1024 = 2048, depending on the number of duplicated 
input elements, resulting in a total time (for our clock speed 
of 500 MHz) of approximately 4-6 jus. Additionally, the total 
transistor count is less than 7 50000 to sort 1024 elements. 

Our design alleviates complex components such as memory 
and pipelining structures, which are considered in hardware 
designs as the bottleneck for performance and power con¬ 
sumption [13]. The only design bottleneck with respect to 
performance is the one-hot decoder; however, an optimized 
version of this component could be used [51], [52]. Since 
our focus is to architect a holistic circuit design, rather 
than optimizing special components and leveraging advanced 
CMOS technologies, we consider the integration of these 
optimizations as orthagonal to our design. 

Fig. 13 shows how the transistor count scales as compared 
to the number of data elements for the order, flag, and sorted 
register arrays since these structures dominate the transistor 
count. These results show that our design’s transistor growth 
rate is linear, with a small increase in the slope rate of less 
than six, giving a linear complexity ratio of O(A) with respect 
to transistor count. 

Fig. 14 shows sorting speed in clock cycle time as compared 
to the number of data elements N = 2 K for a k-bit bus. Our 
results ignore the interconnect parasitic values and the required 
buffering sizes, and focus only on our design’s components’ 
delays. Using the access delay times reported in Table I 
and (12) for upper bound limits on maximum frequency, and 
assume the worst case data distribution (all A elements are 
repeated), Fig. 14 shows a linear complexity of O(A) for 
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TABLE II 

Sorting Computation Time for an Input Set of 1024 Elements 


ALGORITHM 

PLATFORM 

AVERAGE SORTING TIME (nsec) 

NAME 

REF [51 

SINGLE-CORE CPU 

708 

QUICK SORT 

REF [151 

MANY-CORE CPU 

300 

AA-SORT 

REF [191 

SINGLE-CORE GPU 

100 

MIN-MAX Butterfly 

REF [201 

MANY-CORE GPU 

37 

PARALLLEL MERGING 

PROPOSED 

ASIC 

6 

COMPARISON-FREE 


_ 7.00E-01 
» 6.00E-01 
T 5.00E-01 

•g. 4.00E-01 
E 

= 3.00E-01 --- 

° 2.00E-01 
I 1.00E-01 
S. O.OOE+OO •» — • 

0 10000 20000 30000 40000 

Number of Elements 

Fig. 15. Power consumption as compared to number of data elements. 

end-to-end execution time for our sorting design with a small 
growth rate less than 1.5. This small rate is due to using basic 
registers (flag, order, and sorted registers) that access the bus 
in parallel. 

The power consumption is relative to the switching activity 
and the transistors’ static leakage. To reduce power consump¬ 
tion, our design’s datapath and control units’ components 
are gated with enable signals to restrict activity to only the 
components operational periods. The write-evaluate and read- 
sort phases each activate two register arrays: the order and flag 
register arrays, and the flag and sorted register arrays, respec¬ 
tively. Therfore, during the write-evaluate phase, the sorted 
register array is shut off, and in the read-sort phase, the order 
register array is shut off. All other components operate in both 
phases, therefore the phases’ consume approximately equal 
power. Fig. 15 shows our design’s power consumption as 
compared to the number of data elements and assuming a 
500 MHz running frequncy. The operating frequency limits 
are evaluation to a maximum of A = 2 16 data elements, since 
larger sizes would require slower a slower clock frequency. 
Our design’s power consumption shows a linear complexity 
of O(A) for a number of data elements less than 2 16 with a 
growth rate of about 6.4. 

Overall, our design shows a linear growth rate O(A) with 
respect to total transistor count, end-to-end execution time, and 
power consumption. This is in contrast to other work’s [2], 
[35], [41], [48] that report a linear complexity of O(A), but 
the growth rate is usually in the order of greater than 100. 

We also compare our design with data reported in litera¬ 
ture for related CPU and GPU sorting algorithms [5], [15], 
[19], [20]. Table II reports the execution time for sorting 
1024 elements using both single- and multicore CPUs and 
GPUs not considering the the front-end memory initialization 
time and the back-end memory merging time; just only the 
computation time. These results show that our design is 
even faster than prior algorithms who effectively harness the 
computing resources, to the best of our knowledge. 

For general purposes, we have compared our sorting design 
with prior work with respect to hardware complexity and 


TABLE III 

Comparison Between Prior Work and Our 
Proposed Sorting Design 


Sorting 

Design 

Cycle Ranges 

Complexity 

& 

Power Ratio 

[44] 

Not linear: 
for A<200, 

2N cycles + 
initialization time 

Comparisons, pipelining, 
memory, parallelisms, 
large power ratio 

[26] 

Not linear, 
for A< 100, 

3N to 4N cycles 

Comparisons, pipelining, 
memory, parallelisms, 
large power ratio 

[45] 

(N+K-l) cycles + 
initialization time + 
output resorting time 

Comparisons, pipelining, 
memory, large power ratio 

[47] 

Not linear, 

(. N+2K) cycles + 
initialization time 

Comparisons, memory, 
pipelining, moderate 
power ratio 

[31] 

4N to 5 A cycles + 
initialization time 

Comparisons, pipelining, 
memory, parallelisms, 
large power ratio 

Proposed 

2N to 3 A cycles 

No comparison, no 
memory, no pipelining, 
low power ratio 


sorting performance in number of clock cycles. These com¬ 
parisons are independent of technology factors in order to 
avoid uncertainty with respect to different technology scale 
comparisons and technology simulation environments, which 
makes the comparison fair because technology circuit imple¬ 
mentations can vary greatly, ranging from different FPGA 
varieties/families to custom application specified integrated 
circuits using CMOS, NMOS, PMOS, Domino, pass-transistor 
logic families, and many others [53]. These implementation 
specifics have a large influence on the design performance 
and design cost, which may result in unrealistic or inaccurate 
conclusions. Therefore, we compare our design with prior 
designs with respect to common features for sorting hardware 
design circuit architectures, such as the number of cycles 
with respect to number of input elements, design structure 
of the data path and control units that leads to scalability 
and flexibility for different applications, and finally, the design 
computation complexity and data movement directions, which 
impact the design cost and power factor. These types of 
comparisons provide a larger evaluation picture considering 
the huge number of sorting hardware designs. 

Table III compares our design with prior hardware sorting 
algorithms that have a single computing engine and several 
sorting partitions that require merging small sorted partitions 
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TABLE IV 

Comparison With Recent FPGA Sorting Algorithms: Spiral [47] and Resolve [48] 


N Elements 64 1,024 16,384 

FPGA Trans. Count BRAM Time Trans. Count BRAM Time Trans. Count BRAM Time 

Spiral SN1 

2,382,300 

10 

0.5 

35,553,750 

162 

8.2 

- 

- 

- 

Spiral SN2 1 

1,150,450 

5 

3.2 

6,011,850 

45 

82 

1.0X10 8 

964 

1835 

Spiral SN2 S 

2,401,400 

10 

0.5 

17,202,850 

125 

8.2 

17.8X10 8 

1395 

131 

Spiral SN5 

4,126,300 

18 

0.5 

31,058,900 

225 

8.2 

- 

- 

- 

Resolve 

1,123,400 

2 

0.54 

2,599,900 

7 

8.3 

16434350 

70 

131.1 

Comp.-Free 

23,040 

- 

0.4 

614,400 

- 

6.1 

13762560 

- 

98.3 


to obtain the final sorted output. We evaluated the designs 
based on the number of clock cycles required to sort an 
input set of size A. This evaluation illustrates the com¬ 
plexity scaling of our simple forward data flowing design 
for increasing bit-widths as compared to the prior methods 
that merge the datapath and control units’ functionalities 
within the parallel computing cells, memory, and comparison 
circuitry, all of which usually dictate the circuit’s design 
complexity (number of transistors), runtime complexity (num¬ 
ber of cycles to sort A elements), and power. Dividing 
computing cells that integrate the datapath with the control 
unit usually requires two operations: element evaluation and 
result updating, which requires repeating evaluation decisions. 
Furthermore, prior rank-based designs required repeated ALU 
computations within the SRAM or memory array, which is 
usually characterized as being time consuming. 

For additional comparison, we evaluate the data reported 
in [49], which presents recent work on hardware sorting algo¬ 
rithms implemened on the Xilinx FPGA xc7vx690tffgl761-2 
using 32-bit fixed point operations and running at a frequency 
of 125 MHz. Table IV shows the overall transistor counts, 
required number of BRAMs, and sorting time in micro¬ 
seconds. These compared designs show a linear increase 
in the FF/LUT count with respect to the number of ele¬ 
ments, however the BRAM requirements do not scale linearly. 
Since memory devices introduce performance bottlenecks, 
this results in the non-linear execution time and non-linear 
transistor count. 

With respect to all evaluated results, our comparison-free 
sorting design provides an efficient linear scalability of 0(A). 
Our design uses simple registers (flag, order, and sorted 
registers) that are accessed on both the rising and falling 
clock edges, and simple standard CMOS components with 
a forward flowing data movement architecture. Even though 
our design shows a linear performance cost of 0(A), our 
hardware design is recommended for data element set sizes of 
less than 2 16 due to practical integration into large computing 
IC devices (e.g., graphics engines, routers, grid controllers.), 
where the sorting hardware accounts for no more than 10% of 
the IC’s characteristics (power and area). 

VII. Conclusion 

In this paper, we proposed a novel mathematical 
comparison-free sorting algorithm and associated hardware 
implementation. Our sorting design exhibits linear complexity 


0(A) with respect to the sorting speed, transistor count, and 
power consumption. This linear growth is with respect to the 
number of elements A for A = 2 K where K is the bit width 
of the input data. The slope of the linear growth rate is small, 
with a growth rate of approximately 6 for the transistor count 
and power consumption, and 1.5 for the sorting speed. 

The order complexity and growth rates are due to 
simple basic circuit components that alleviate the need 
for SRAM-based memory and pipelining complexity. Our 
mathematically-simple algorithm streamlines the sorting oper¬ 
ation in one forward flowing direction rather than using 
compare operations and frequent data movement between the 
storage and computational units, as with other sorting algo¬ 
rithms. Our design uses simple standard library components 
including registers, a one-hot decoder, a one detector, an incre- 
menter/decrementer, and a PC, combined with a simple control 
unit that contains a small amount of delay logic. 

Our design is at least 6x faster than software parallel 
algorithms that harness powerful computing resources for 
input data set sizes in the small-to-moderate range up to 2 16 . 
Additionally, our hardware design’s performance is approxi¬ 
mately 1.5 x better as compared to other optimized hardware- 
based hybrid sorting designs in terms of transistor count and 
design scalability, number of clock cycles and critical path 
delay, and power consumption. Thus, our design is suitable 
for most IC systems that require sorting algorithms as part of 
their computational operations. 

Our results show that our comparison-free sorting CMOS 
hardware can sort A unsigned integer elements from end-to- 
end with any input data set distribution within 2A to 3 A 
clock cycles (lower and upper bounds, respectively) at a clock 
frequency of 0.5 GHz using a 90-nm TSMC technology with 
a 1 V power supply and a power consumption of 1.6 mW for 
A = 1024 elements. 

Future work includes leveraging our sorting algorithm for 
commercial parallel processing computing power, such as 
GPUs and parallel processing machines, in order to further 
improve large-scale sorting, and thus, further enhance embed¬ 
ded sorting for big data applications. 
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